Introduction to Multiple Linear Regression

Before…

Now…



How?

Offsets!

smoke_lm <- lm(weight ~ weeks * habit, data = ncbirths)

get_regression_table(smoke_lm)
# A tibble: 4 × 3
  term              estimate std_error
  <chr>                <dbl>     <dbl>
1 intercept           -5.94      0.484
2 weeks                0.341     0.013
3 habit: smoker       -1.86      1.63 
4 weeks:habitsmoker    0.039     0.042

The * means the variables are interacting!


What is the regression equation for non-smoker mothers?

What is the regression equation for smoker mothers?

What if we have a second numerical explanatory variable?

Multiple slopes

age_lm <- lm(weight ~ weeks + mage, data = ncbirths)

get_regression_table(age_lm)
# A tibble: 3 × 3
  term      estimate std_error
  <chr>        <dbl>     <dbl>
1 intercept   -6.68      0.492
2 weeks        0.346     0.012
3 mage         0.02      0.006


How do you interpret the value of 0.346?

How do you interpret the value of 0.02?

But how do we decide if the interaction model is “best” without a p-value??????

When investigating if a relationship differs…

Always start with the “interaction” / different slopes model.

If the slopes look different, you’re done!

If the slopes look similar, then fit the “additive” / parallel slopes model.

Different Enough?

Behind the Plot

ggplot(data = MA_schools, 
       mapping = aes(y = average_sat_math, 
                       x = perc_disadvan, 
                       color = size)) + 
  geom_point() +
  geom_smooth(method = "lm") + 
  labs(x = "Percent Economically Disadvantaged", 
       y = "Average SAT Math", 
       color = "Size of School")


geom_smooth() allows for both the intercepts and the slopes to differ

What about now?

# A tibble: 6 × 3
  term                     estimate std_error
  <chr>                       <dbl>     <dbl>
1 intercept                 594.       13.3  
2 perc_disadvan              -2.93      0.294
3 size: medium              -17.8      15.8  
4 size: large               -13.3      13.8  
5 perc_disadvan:sizemedium    0.146     0.371
6 perc_disadvan:sizelarge     0.189     0.323

🤨

Who is baseline?

Deciphering groups – Small schools

\[\widehat{SAT}_{small} = 594 - 2.93 \times \text{percent disadvan}\]

Deciphering groups – Medium schools

\[\widehat{SAT}_{medium} = (594 - 17.8) + (- 2.93 + 0.146) \times \text{percent disadvan}\]

\[\widehat{SAT}_{medium} = 576.2 - 2.784 \times \text{percent disadvan}\]

Deciphering groups – Large schools

\[\widehat{SAT}_{large} = (594 - 13.3) + (- 2.93 + 0.189) \times \text{percent disadvan}\]

\[\widehat{SAT}_{large} = 580.7 - 2.741 \times \text{percent disadvan}\]

What if they’re not very different?

Parallel Slopes

lm(average_sat_math ~ perc_disadvan + size, data = MA_schools)


# A tibble: 4 × 3
  term          estimate std_error
  <chr>            <dbl>     <dbl>
1 intercept       588.       7.61 
2 perc_disadvan    -2.78     0.106
3 size: medium    -11.9      7.54 
4 size: large      -6.36     6.92 

Group equations – Baseline

# A tibble: 4 × 3
  term          estimate std_error
  <chr>            <dbl>     <dbl>
1 intercept       588.       7.61 
2 perc_disadvan    -2.78     0.106
3 size: medium    -11.9      7.54 
4 size: large      -6.36     6.92 

\[\widehat{SAT}_{small} = 588 - 2.78 \times \text{percent disadvantaged}\]

Group equations – Offsets

# A tibble: 4 × 3
  term          estimate std_error
  <chr>            <dbl>     <dbl>
1 intercept       588.       7.61 
2 perc_disadvan    -2.78     0.106
3 size: medium    -11.9      7.54 
4 size: large      -6.36     6.92 

\[\widehat{SAT}_{medium} = (588 - 11.9) - 2.78 \times \text{percent disadvan}\]

\[\widehat{SAT}_{medium} = 576.1 - 2.78 \times \text{percent disadvan}\]

\[\widehat{SAT}_{large} = (588 - 6.36) - 2.78 \times \text{percent disadvan}\]

\[\widehat{SAT}_{large} = 581.64 - 2.78 \times \text{percent disadvan}\]